class: center, middle, inverse, title-slide .title[ # Class 2b: Review of concepts in Probability and Statistics ] .author[ ### Business Forecasting ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; } </style> --- layout: false class: inverse, middle # Summarizing Data ## Summary Statistics --- # Measures of Central Tendency ## Mean - **Mean** represents the arithmetic average of the data. - Sometimes called the expected value of the random variable E(X) - The population mean `\(\mu\)` is the sum of all observations divided by the total population size: `$$\mu =E(X)=\frac{\sum_{i=1}^{N} x_i}{N}=\sum_{x\in X}P(X=x) \times x$$` - where `\(N\)` is the total population size, and `\(x_i\)` are individual data points. - The sample mean, denoted as `\(\bar{x}\)`, is the sample equivalent: `$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1+x_2+...x_{n-1}+x_n}{n}$$` where `\(n\)` is the sample size. --- ## Mean Intuitively, mean is the balancing point of the distribution. <!-- --> --- ## Mean of a binary variable What if a mean of a **binary variable**? - Binary variable is a variable which takes value 0 or 1 - For example: do you have diabetes (yes=1, no=0) -- What is the intuitive interpretation of the mean of this variable? - `\(\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\)` - `\(\bar{x} = \frac{1+0+0+...0+1}{n}=\frac{n_{diabetes}}{n}=\hat{\mu}_{diabetes}\)` -- It's the proportion of people with diabetes in the sample: mean(diabetes)= 0.11 --- ## Weighted Mean - In some scenarios, data points have different weights. - For a dataset with weights `\(w_i\)` and values `\(x_i\)`, the weighted mean is: `$$\small \text{Weighted Mean} = \frac{\sum_{i=1}^{n} w_i \cdot x_i}{\sum_{i=1}^{n} w_i}$$`
The ** weighted mean** is: `\begin{align*} \small \bar{x} & =\frac{0.2\times 6+0.2\times 8+0.15 \times 9+ 0.15 \times 4+0.3 \times 8}{0.2+0.2+0.15+0.15+0.3} \end{align*}` --- ## Mean - Is mean always a right measure? #### "Bill Gates walks into a bar" - Suppose a group of people, including Bill Gates, walks into a bar. - Let's say the net worth of everyone in the group is as follows: .pull-left[
] .pull-right[ The **mean** is: `\begin{align*} \bar{x} & =\frac{10 + 20 + 30 + 40 + 50 + 60000}{6} \\ & = 100025 \\ \end{align*}` Mean is seriously skewed due to the outlier. ] --- ## Mean vs Median <center> <img src=mean_median.jpg width="800"> </center> --- ## Median - **Median** represents the middle value when data is sorted - Half of observations are below it, half are above it. - For a dataset with odd size `\(n\)`, the median is the `\(\frac{n+1}{2}\)`-th value - For even size `\(n\)`, it's the average of `\(\frac{n}{2}\)`-th and `\(\frac{n}{2}+1\)`-th values. .pull-left[ | Day | Number of Customers | |-----|---------------------| | 1 | 20 | | 2 | 18 | | 3 | 25 | | 4 | 22 | | 5 | 30 | | 6 | 21 | | 7 | 27 | ] .pull-right[ The dataset has `\(n=7\)` (odd) observations, so to find the median: - Arrange the data in ascending order: - 18, 20, 21, 22, 25, 27, 30. - The median is the `\(\frac{n+1}{2}\)`-th value, which is the 4th value. - Thus, the median is the 4th value, which is 22. ] --- ### Let's look at the median weight in our population - Mean: 72.66451 - Median: 70.7536 -- <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/unnamed-chunk-5-1.png" width="100%" /> - Mean is dotted - Median is dashed --- ### Median and outliers I added couple of observations on the right tail of the distribution - Old Mean: 72.66, **New Mean: 77.05** - Old Median: 70.75, **New Median: 70.95** <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/abc-1.png" width="100%" /> --- ## Side note on the Mode **Mode** is the most frequent value in the data - Let's look at the distribution of age of people with diabtese
--- ## Mode <center> <img src=mode.jpg width="400"> </center> --- ## Percentiles <center> <img src=Natalie.jpg width="300"> </center> --- ## Percentiles <center> <img src=t_swift2.jpg width="300"> </center> --- ## Percentiles <center> <img src=t_swift_01.jpg width="300"> </center> --- --- ## Percentiles <center> <img src=t_swift_0001.jpg width="300"> </center> --- ## Percentiles <center> <img src=WSJ.jpg width="800"> </center> Credits: Wall Street Journal [Article Link](https://www.wsj.com/tech/personal-tech/spotify-wrapped-2023-taylor-swift-e303333d) --- ## Percentiles - How much inventory of milk you need to keep in your Starbucks? -- - What is the tradeoff of keeping too much vs too litle inventory? -- - Suppose we want to have enough of milk to cover sales on 95% of days -- - To figure it out, let's look at the distribution of the daily use of milk <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/Sales_dist_figure-1.png" width="100%" /> --- ## Percentiles - Let `\(s_i\)` be the daily sales of milk - We want to choose amount `\(M\)`, such that `\(P(s_i \leq M)=0.95\)` - That is, in 95% of days sales are smaller or equal than `\(M\)` -- <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/Sales_dist_figure_with_shaded_region-1.png" width="100%" /> -- - What is this number? - It's the 95th percentile of the distribution (274 liters) --- ## Percentiles - *Percentiles* divide the ordered data into 100 equal parts. - `\(p\)`th percentile is a value such that `\(p\%\)` of the data are below it - `\(v_p\)` is such that `\(P(x_i \leq v_p)=p\)` - `\(v_{95}\)` is such that `\(P(x_i \leq v_{95})=95\%\)` --- ## Percentiles -- - What is the the height such that 75% of ITAM students are smaller than this height? -- - What is the income level such that 25% of people in Mexico earn less than that level? -- - What is the age, such that 50% of people die before that age? --- ## Percentiles <center> <img src=Exam_q_top5.png width="800"> </center> --- ## How to find it in a sample 1. Arrange the data in ascending order -- 2. Find which observation corresponds to the relevant percentile - Formula: `\(i = \left(\frac{p}{100}\right)(n+1)\)` - Example: To find 95th percentile in a sample of 1000 observations we look at `\(i = \left(\frac{95}{100}\right)(1000+1)=950.95\)` observation -- 3. If it's an integer, value of ith observation is your percentile 4. If it's not, take the average between ith rounded down and ith rounded up - In our example it would be the average of 950th and 951th observation --- ## Or use the CDF - `\(ECDF(v)=P(x_i \leq v)\)`
--- ## Common values - **Median** - 50th percentile - half of the values are below the median - **Quartiles** - 25th, 50th and 75th percentile. - How poor is the poorest quartile of the society? - Their income is below the 25th percentile <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/Image_with_shaded_area_Qtiles-1.png" width="100%" /> --- - **Deciles** - 10th, 20th, ... 90th - How bad pollution gets in CDMX during top 10% polluted days? - During top 10% of polluted days pollution level is larger or than 9th decile. <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/Image_with_shaded_area_Deciles-1.png" width="100%" /> --- ### Example with data Here is a data on distribution of how many views have various tik-tok videos. - What is the 1st decile? - What is the 95th percentile?
-- - Index for the first decile is: `\(i = \left(\frac{10}{100}\right)(200+1)=20.1\)` - First decile is the average of the 20th and 21st observation - Index for the 95th percentile is: `\(i = \left(\frac{95}{100}\right)(200+1)=190.95\)` - 95th percentile is the average of the at 190th and 191st observation --- ### Exercises: - Review Exercises: - PDF 2: 3,4,5,6, - Homeworks - Lista 00.1: 3